{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "os.environ['CUDA_VISIBLE_DEVICES'] = ''"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/ubuntu/.local/lib/python3.8/site-packages/malaya/tokenizer.py:202: FutureWarning: Possible nested set at position 3361\n",
      "  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))\n",
      "/home/ubuntu/.local/lib/python3.8/site-packages/malaya/tokenizer.py:202: FutureWarning: Possible nested set at position 3879\n",
      "  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))\n"
     ]
    }
   ],
   "source": [
    "import malaya"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [],
   "source": [
    "files = ['test-translated-cnn-daily.jsonl', \n",
    "         'train-translated-cnn-daily.jsonl'\n",
    "        ]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "\n",
    "with open(files[0]) as fopen:\n",
    "    for l in fopen:\n",
    "        data = json.loads(l)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "maxlen = 900"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Peter Morris hancur ketika adiknya, Claire, meninggal dalam kemalangan kereta. Hanya setahun sebelumnya, dia telah berkahwin dengan Malcolm Webster. Dia percaya adiknya telah meninggal dalam kemalangan tragis. Perlu dua dekad kebenaran untuk dinyatakan. Claire dibunuh oleh Webster sehingga dia dapat memperoleh insurans hayatnya. Berusaha melakukan perkara yang sama kepada isteri keduanya pada tahun 1999 - dan dipenjara pada tahun 2011.'"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data['translated_highlight']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "def cut_article(article, maxlen = 900):\n",
    "    splitted = malaya.text.function.split_into_sentences(article, minimum_length = 2)\n",
    "    temp = []\n",
    "    for s in splitted:\n",
    "        temp.append(s)\n",
    "        if len(' '.join(temp).split()) > maxlen:\n",
    "            break\n",
    "    \n",
    "    return ' '.join(temp)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "\"Peter Morris was devastated when his sister, Claire, died in a car crash when she was just 32. Only a year before, he had walked her down the aisle to give her away to her husband, Malcolm Webster, from Surrey, then aged 33, who seemed equally bereft at the loss of his wife. Peter, from Gillingham, Kent, believed Claire had died in a tragic accident that happened in Aberdeen, Scotland, in 1994. Scroll down for video . Peter Morris, left, was devastated when his sister, Claire, pictured right on her hen do, died in a car crash . It took two decades for the truth to be revealed - Claire had in fact been murdered by Webster so he could cash in on her life insurance. Webster had drugged his wife at their home in the Scottish village of Tarves, where they had settled after their wedding, before taking her for a drive. He then deliberately crashed their car and, after making his own escape, set it on fire so Claire had no hope of survival. Last year Claire's story was turned into an ITV drama starring Sheridan Smith. And now Peter has gone in front of the cameras to talk about his grief and to reveal the shocking detail of his sister's death, which he is still coming to terms with 21 years later. 'The fire brigade said it was the most intense fire they had ever seen from a mile away, the flames were leaping 20, 30 feet, it was just ridiculous,' Peter reveals on the latest episode of Britain's Darkest Taboos. 'A very brave fireman tried to retrieve what was left of Claire. There wasn't much left of her, poor love. ' Peter said he was shocked and heartbroken when he was told his sister, a qualified nurse, had lost her life in just terrible circumstances. Claire and Malcolm Webster pictured on their wedding day in 1993, the following year he murdered her . He had no reason to suspect Webster, who Claire had met at a party, had drugged and then burned her alive. Claire had been besotted with him and was delighted when they tied the knot in 1993. Peter recalls: 'He seemed like the perfect gentleman which is exactly what she would have wanted. He was a sort of Colin Firth type character. ' He said he and Webster consoled one another at the funeral and Peter believed he was suffering just as much as he was. 'He didn't say a lot at the funeral, he appeared to be in a complete state of shock as well,' he said. 'We were all standing round the graveside, in my right hand I was holding on to Malcolm's left hand. 'I was in floods of tears, and I looked at Malcolm, and he, he was also in floods of tears. And that more than anything else convinced me that it had been a genuine accident. ' For years Peter continued to believe the car crash had been accidental, and while some officers who investigated the scene of the crash had their misgivings, they had no concrete evidence to suggest anything different. Felicity Drumm, pictured on Britain's Darkest Taboos, was Webster's second wife. He attempted to kill her in circumstances similar to Claire's death . Peter lost touch with his brother-in-law when Webster moved to New Zealand - but it was there that his true colours were revealed when he tried to murder again. He married Felicity Drumm and they had son together. Unbeknownst to Felicity, he started faking her signature to take out life insurance polices in her name and also began poisoning her like he had done to Claire. Felicity also appears on Britain's Darkest Taboos and reveals how it felt to be drugged. She said: 'These episodes would often begin with a feeling of blurred vision or double vision and unsteadiness on my feet, my speech was sometimes become a little bit slurred, I would stagger around as if I was drunk and then would fall into a deep sleep. ' She couldn't believe her husband could be the cause of her illness but a pattern was forming. 'These events occurred after he had prepared a meal or a drink for me, they were always at the weekends, never inconveniently when I had to be able to go to work,' she said. Then in circumstances similar to Claire's death, in 1999 she was driven by Webster to a remote location when she was feeling tired and drowsy. It seems Webster could have been about to commit murder again - until he was interrupted by a phone call from Felicity's concerned parents. Felicity recalls how Webster was angry that the call had woken her up and urged her to ignore it but she persuaded him to drive her home so she could see her father. Webster then fled while Felicity learned some shocking truths from her parents. Webster pictured in 2011 when he was jailed for life after being found guilty of murder . 'We discovered that he'd opened up a private mail box, he had redirected all of our mail including bank mail which had prevented me from discovering that gradually the account was being fleeced,' she said. 'There were enquiries to real estate agents back in the Cornwall and Devon area suggesting he would be returning shortly to the UK with his infant son.\""
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cut_article(data['article'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Peter Morris hancur ketika adiknya, Claire, meninggal dalam kemalangan kereta ketika dia baru berusia 32 tahun. Hanya setahun sebelumnya, dia telah berjalan menyusuri lorong untuk memberikannya kepada suaminya, Malcolm Webster, dari Surrey, yang ketika itu berusia 33 tahun, yang sepertinya sama-sama bingung dengan kehilangan isterinya. Peter, dari Gillingham, Kent, percaya Claire telah meninggal dalam kemalangan tragis yang berlaku di Aberdeen, Scotland, pada tahun 1994. Tatal ke bawah untuk video. Peter Morris, kiri, hancur ketika adiknya, Claire, yang digambarkan tepat di ayamnya, meninggal dalam kemalangan kereta. Perlu dua dekad kebenaran untuk dinyatakan - Claire sebenarnya telah dibunuh oleh Webster sehingga dia dapat memperoleh insurans hayatnya. Webster telah meminum isterinya di rumah mereka di kampung Tarves di Scotland, di mana mereka telah menetap setelah perkahwinan mereka, sebelum membawanya untuk memandu. Dia kemudian sengaja merempuh kereta mereka dan, setelah melarikan diri, membakarnya sehingga Claire tidak mempunyai harapan untuk bertahan hidup. Tahun lalu kisah Claire diubah menjadi drama ITV yang dibintangi oleh Sheridan Smith. Dan sekarang Peter telah pergi ke depan kamera untuk membicarakan kesedihannya dan untuk mengungkapkan perincian mengejutkan kematian adiknya, yang masih dia hadapi 21 tahun kemudian. \"Pasukan bomba mengatakan itu adalah kebakaran paling kuat yang pernah mereka lihat dari jarak satu batu, api melonjak 20, 30 kaki, itu hanya tidak masuk akal,\" Peter mendedahkan pada episod terbaru Taboos Paling Gelap di Britain. \"Seorang anggota bomba yang sangat berani cuba mengambil apa yang tersisa dari Claire. Tidak banyak yang tersisa dari cintanya yang buruk. \"Peter mengatakan dia terkejut dan patah hati ketika diberitahu adiknya, seorang jururawat yang berkelayakan, telah kehilangan nyawanya dalam keadaan yang mengerikan. Claire dan Malcolm Webster bergambar pada hari perkahwinan mereka pada tahun 1993, pada tahun berikutnya dia membunuhnya. Dia tidak mempunyai alasan untuk mengesyaki Webster, yang telah bertemu Claire di sebuah pesta, telah meminum dadah dan kemudian membakarnya hidup-hidup. Claire telah dikelilingi dengannya dan gembira ketika mereka mengikat tali pada tahun 1993. Peter ingat: \\'Dia kelihatan seperti lelaki yang sempurna yang sebenarnya dia mahukan. Dia adalah sejenis watak jenis Colin Firth. \"Dia mengatakan bahawa dia dan Webster saling berkonsol di pengebumian dan Peter percaya dia menderita sama seperti dia. \"Dia tidak banyak mengatakan di pengebumian, dia juga kelihatan dalam keadaan terkejut,\" katanya. \"Kami semua berdiri di sekitar kubur, di tangan kanan saya memegang tangan kiri Malcolm\". Saya dalam keadaan menangis, dan saya memandang Malcolm, dan dia, dia juga dalam keadaan menangis. Dan itu lebih daripada yang lain meyakinkan saya bahawa ia adalah kemalangan yang sebenarnya. \"Selama bertahun-tahun Peter terus percaya bahawa kemalangan kereta itu tidak sengaja, dan sementara beberapa pegawai yang menyiasat tempat kejadian mengalami keraguan, mereka tidak mempunyai bukti konkrit untuk menunjukkan sesuatu yang berbeza. Felicity Drumm, yang digambarkan pada Taboo Tergelat Britain, adalah isteri kedua Webster. Dia cuba membunuhnya dalam keadaan yang serupa dengan kematian Claire. Peter kehilangan hubungan dengan kakak iparnya ketika Webster pindah ke New Zealand - tetapi di sana warna sebenarnya terungkap ketika dia cuba membunuh lagi. Dia berkahwin dengan Felicity Drumm dan mereka mempunyai anak bersama. Tanpa diketahui Felicity, dia mula memalsukan tandatangannya untuk mengeluarkan polisi insurans hayat atas namanya dan juga mula meracuninya seperti yang dilakukannya kepada Claire. Felicity juga muncul di Taboo Tergelat Britain dan mengungkapkan bagaimana rasanya dibius. Dia berkata: \\'Episode-episod ini sering kali dimulai dengan perasaan penglihatan kabur atau penglihatan berganda dan ketidakstabilan di kaki saya, ucapan saya kadang-kadang menjadi sedikit kabur, saya akan terus berkeliaran seolah-olah saya mabuk dan kemudian akan tidur nyenyak. \"Dia tidak percaya suaminya boleh menjadi penyebab penyakitnya tetapi corak terbentuk. \"Kejadian-kejadian ini berlaku setelah dia menyediakan makanan atau minuman untuk saya, mereka selalu pada hujung minggu, tidak pernah menyusahkan ketika saya harus pergi bekerja,\" katanya. Kemudian dalam keadaan yang serupa dengan kematian Claire, pada tahun 1999 dia didorong oleh Webster ke lokasi terpencil ketika dia merasa letih dan mengantuk. Nampaknya Webster mungkin akan melakukan pembunuhan lagi - sehingga dia terganggu oleh panggilan telefon dari ibu bapa Felicity yang berkenaan. Felicity ingat bagaimana Webster marah kerana panggilan itu telah membangunkannya dan mendesaknya untuk mengabaikannya tetapi dia memujuknya untuk membawanya pulang sehingga dia dapat melihat ayahnya. Webster kemudian melarikan diri sementara Felicity mengetahui beberapa kebenaran yang mengejutkan dari ibu bapanya. Webster bergambar pada tahun 2011 ketika dia dipenjara seumur hidup setelah didapati bersalah membunuh. \"Kami mendapati bahawa dia telah membuka kotak surat peribadi, dia telah mengarahkan semua surat kami termasuk surat bank yang telah menghalang saya untuk mengetahui bahawa secara beransur-ansur akaun itu melarikan diri,\" katanya. \"Terdapat pertanyaan kepada ejen harta tanah kembali di daerah Cornwall dan Devon yang menunjukkan bahawa dia akan kembali tidak lama lagi ke UK bersama anak lelakinya yang masih kecil. Di bahagian bawah beg bimbitnya terdapat sejumlah sembilan polisi insurans hayat, yang secara kolektif melebihi satu juta dolar atas kematian saya, dan mereka mempunyai tandatangan saya dipalsukan pada mereka. \"Dengan sedih, mereka juga menemui tabung petrol di dalam keretanya dan Felicity yakin dia hampir melarikan diri dari kematian pada hari itu. Webster kembali ke UK sementara Felicity melaporkannya kepada polis New Zealand. Setelah menyiasat, mereka mendesak polis Scotland untuk membuka semula kes itu ke kematian Claire. Pegawai utama dalam kes itu ialah Charles Henry. Dia berkata: \\'Kemalangan jalan raya yang asli sangat mencurigakan dan menimbulkan semua ciri khas dari apa yang berlaku di New Zealand. Pada ketika itu kami menghadapi individu yang sangat berbahaya yang menjadi ancaman kepada mana-mana wanita yang akan dia bersama. \\'Felicity digambarkan pada tahun 2011 ketika dia memberikan bukti terhadap Webster semasa perbicaraan di mana dia dijuluki \\'balu hitam\\' Namun, polis akan mengambil masa bertahun-tahun untuk membina kes yang cukup kuat untuk membuktikan kematian Claire bukan kemalangan dan mendapat bukti yang cukup untuk menuduh Webster.'"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cut_article(data['translated_article'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "test-translated-cnn-daily.jsonl\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "11490it [00:31, 369.96it/s]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "train-translated-cnn-daily.jsonl\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "287219it [14:33, 329.00it/s]\n"
     ]
    }
   ],
   "source": [
    "from tqdm import tqdm\n",
    "\n",
    "with open('train.json', 'w') as fopen_jsonl:\n",
    "    for f in files:\n",
    "        print(f)\n",
    "        with open(f) as fopen:\n",
    "            for l in tqdm(fopen):\n",
    "                data = json.loads(l)\n",
    "                en = cut_article(data['article'])\n",
    "                ms = cut_article(data['translated_article'])\n",
    "                tgt = data['translated_highlight']\n",
    "                \n",
    "                d = {\"translation\": {\"src\": en, \"tgt\": tgt, 'prefix': 'ringkasan: '}}\n",
    "                fopen_jsonl.write(f'{json.dumps(d)}\\n')\n",
    "                d = {\"translation\": {\"src\": ms, \"tgt\": tgt, 'prefix': 'ringkasan: '}}\n",
    "                fopen_jsonl.write(f'{json.dumps(d)}\\n')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "val-translated-cnn-daily.jsonl\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "13368it [00:46, 287.36it/s]\n"
     ]
    }
   ],
   "source": [
    "with open('test.json', 'w') as fopen_jsonl:\n",
    "    for f in ['val-translated-cnn-daily.jsonl']:\n",
    "        print(f)\n",
    "        with open(f) as fopen:\n",
    "            for l in tqdm(fopen):\n",
    "                data = json.loads(l)\n",
    "                en = cut_article(data['article'])\n",
    "                ms = cut_article(data['translated_article'])\n",
    "                tgt = data['translated_highlight']\n",
    "                \n",
    "                d = {\"translation\": {\"src\": en, \"tgt\": tgt, 'prefix': 'ringkasan: '}}\n",
    "                fopen_jsonl.write(f'{json.dumps(d)}\\n')\n",
    "                d = {\"translation\": {\"src\": ms, \"tgt\": tgt, 'prefix': 'ringkasan: '}}\n",
    "                fopen_jsonl.write(f'{json.dumps(d)}\\n')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [],
   "source": [
    "!head -n 4000 test.json > test-4k.json"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [],
   "source": [
    "!cp ../translation/run_t5.py ."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
